-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12913] [SQL] Improve performance of stat functions #10960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @mengxr |
|
Test build #50240 has started for PR 10960 at commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a Cast() here is very expensive
|
Test build #50265 has finished for PR 10960 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those if branches are important to save computation for low-order statistics. Even we won't use CentralMomentAgg for second-order statistics, it is still good to keep them.
|
@davies Did you get a chance to test whole-stage codegen with higher-order statistics like skewness? If it works, the cleanest solution would be changing |
|
Test build #50294 has finished for PR 10960 at commit
|
|
Test build #50297 has finished for PR 10960 at commit
|
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala
|
Test build #50304 has finished for PR 10960 at commit
|
|
@davies side note: The JIRA number is wrong. |
|
Test build #50322 has finished for PR 10960 at commit
|
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala
|
Test build #50384 has finished for PR 10960 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation of Corr/Covar have better accuracy, so updated the tests to match that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to see we can tolerate some small numerical differences in query tests. But this is out of scope here.
|
Test build #50386 has finished for PR 10960 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto on EqualTo => ===
|
@davies I made one pass. It would be nice to have a JIRA for checking query result with tolerance on numerical differences, because the result might change (though unlikely) if we merge the partial results in a different order. |
|
@mengxr Thanks for reviewing this, I should had addressed all your comments. |
|
Test build #50560 has finished for PR 10960 at commit
|
|
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for some miscommunication. The previous inline comments are useful here because Lit(0.0) carries no information. The comments are not necessary when the variable names can clearly tell what they are. Please recover the inline comments for initial values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no difference for these initial values, the order does not matter here. Do you still think we should keep those comments? Or should I change to use fill()?
|
Test build #50565 has finished for PR 10960 at commit
|
|
LGTM pending Jenkins. It is great to see 5x speedup! |
|
Test build #50569 has finished for PR 10960 at commit
|
|
Merging this into master, thanks! |
|
Test build #50572 has finished for PR 10960 at commit
|
As benchmarked and discussed here: https://github.com/apache/spark/pull/10786/files#r50038294, benefits from codegen, the declarative aggregate function could be much faster than imperative one.